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Abstract 

Seismograms  predicted  from  acoustic  or  elastic  earth  models  depend 
very  nonlinearly  on  the  long  wavelength  components  of  velocity.  This 
sensitive  dependence  demands  the  use  of  special  variational  principles  in 
waveform-based  inversion  algorithms.  The  differential  semblance  varia- 
tionaJ  principle  is  well-suited  to  velocity  inversion  by  gradient  methods, 
since  its  objective  function  is  smooth  and  convex  over  a  large  range  of 
velocity  models.  An  extension  of  the  adjoint  state  technique  yields  an 
accurate  estimate  of  the  differential  semblance  gradient.  Nonlinear  conju¬ 
gate  gradient  iteration  is  quite  successful  in  locating  the  global  differential 
semblance  minimum,  which  is  near  the  ordinary  least  squares  global  min¬ 
imum  when  coherent  data  noise  is  small.  Several  examples  based  on  the 
2D  primaries-only  acoustic  model  illustrate  features  of  the  method  and 
its  performance. 
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Introduction 


This  paper  presents  an  implementation  of  a  differential  semblance  optimization 
algorithm  for  estimation  of  velocities  and  reflectivities  from  multioffset  reflec¬ 
tion  seismograms.  Differential  semblance  combines  concepts  from  nonlinear 
least-squares  inversion,  migration  velocity  analysis,  and  travel  time  tomogra¬ 
phy.  It  poses  the  estimation  problem  as  a  variational  principle  of  least-squares 
type.  This  principle  is  defined  in  an  expanded  space  of  experiment-dependent  or 
incoherent  models,  which  contains  the  normal  model  space  as  a  proper  subset. 
A  gradient-based  local  optimization  algorithm  solves  this  problem  efficiently, 
whereas  other  least-error  formulations  of  multioffset  waveform  inversion  appear 
to  require  global  optimization  algorithms  such  as  Monte  Carlo  search  or  simu¬ 
lated  annealing  (Mosegard  and  Tarantola  1991  [21],  Stoffa  and  Sen  1991  [27], 
Scales  et  al.,  1991  [26]).  Nonetheless,  the  solution  of  the  differential  semblance 
optimization  problem  closely  approximates  the  solution  of  the  straightforward 
mean-square  best-fit  formulation  of  velocity  estimation  when  the  amount  of  co¬ 
herent  noise  in  the  data  is  small.  Thus  differential  semblance  optimization  may 
be  regarded  as  an  infeasible  point  method  for  ordinary  least-squares  inversion. 

The  next  section  introduces  the  differential  semblance  approach  as  a  vari¬ 
ant  of  nonlinear  least-squares  inversion,  in  general  terms.  The  following  two 
section  present  the  detailed  formulation  of  differential  semblance  optimization 
for  primaries-only  2D  acoustics,  and  describe  the  gradient  calculation  and  opti¬ 
mization,  respectively.  In  fourth  section  .several  simple  layered  and  nonlayered 
synthetic  examples  illustrate  the  behaviour  of  the  method.  The  algorithm  de¬ 
scribed  here  applies  a  prion  to  multidimensional  models.  The  translational 
symmetry  of  data  from  layered  models  allows  the  extraction  of  all  kinematic 
and  dynamical  information  from  a  small  part  of  the  data,  however.  We  take 
advantage  of  this  redundancy  to  minimize  the  computational  expense  of  inver¬ 
sion  of  layered  models,  without  changing  the  algorithm.  The  inversion  examples 
presented  below  are  all  layered  for  this  reason.  We  also  exhibit  differential  sem¬ 
blance  gradients  for  nonlayered  models,  which  confirm  the  sensitivity  of  the 
objective  function  to  lateral  velocity  heterogeneities. 

The  paper  ends  with  a  discussion  of  migration  velocity  analysis,  travel  time 
tomography,  and  least-squares  inversion  and  shows  how  differential  semblance 
optimization  is  closely  related  to  each  of  these  techniques.  An  appendix  provides 
a  detailed  derivation  of  the  gradient  calculation. 
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The  differential  semblance  concept 


Differential  semblance  optimization  is  a  variant  of  nonlinear  least  squares  in¬ 
version,  in  which  an  earth  model  is  sought  to  minimize  the  difference  between 
predicted  and  measured  data.  Least  squares  inversion  accomodates  a  variety 
of  physical  descriptions  of  seismic  wave  propagation  and  possesses  an  elegant 
information-theoretical  justification  (Tarantola  1987)  [41],  As  a  result  the  tech¬ 
nique  has  received  considerable  attention  in  the  last  years. 

One  of  the  first  results  of  numerical  work  on  least  squares  was  the  discovery 
that  the  mean  square  misfit  function  is  severely  nonconvex  in  the  long  wave¬ 
length  or  smooth  components  of  velocities  (Kolb  et  al.  1986  [17],  Gauthier  et 
al.  1986  [14],  Symes  and  Santosa  1989  [24]).  As  a  result  most  recent  work 
has  either  used  conventional  velocity  analysis  or  travel  time  tomography  to  ad¬ 
just  the  smooth  part  of  the  velocity  (Helgesen  and  Kolb  1989  [16],  Noble  1992 
[22])  or  employed  global  nonconvex  optimization  technology  such  as  simulated 
annealing  or  genetic  algorithms  (Cao  et  al.  1990  [5],  Mosegard  and  Tarantola 
1991  [21],  Stoffa  and  Sen  1991  [28],  [27]  Scales  et  al.  1991  [26]).  The  latter 
techniques  appear  to  require  a  very  large  number  of  simulations  (10®  -  10®  are 
typical  numbers).  They  depend  for  their  success  on  the  presumption  that  a 
“significant”  part  of  model  space  will  have  been  sampled,  but  no  criteria  for 
adequate  sampling  appear  to  be  available. 

The  root  cause  of  the  nonconvexity  of  the  least  squares  objective  is  the 
difficulty  of  fitting  all  gathers  in  a  multi-gather  (i.e.  multi-experiment)  data 
set  simultaneously,  unle.ss  the  velocity  (i.e.  the  kinematical  aspect  of  the  earth 
model)  is  very  nearly  correct.  This  observation  holds  for  all  types  of  data  gathers 
(common  source,  common  receiver,  common  offset,  plane  wave,...).  On  the  other 
hand  fitting  a  single  gather  is  a  relatively  easy  task. 

This  reasoning  suggests  the  introduction  of  an  experiment-dependent  model 
space.  Since  the  actual  earth  is  not  experiment-dependent,  a  suitable  objective 
function  for  inversion  should  be  designed  to  minimize  the  variance  of  the  model 
with  experiment.  Accordingly  we  propose  an  objective  function  containing; 

•  a  least  squares  (or  other  convenient)  data  fitting  term; 

•  a  differential  semblance  term  penalizing  deviations  between  the  models  for 
neighboring  experiments. 

Now  such  an  objective  function  is  just  as  nonconvex  as  is  is  the  usual  least 
squares  objective,  since  the  latter  is  contained  in  the  former.  However  the 
analysis  of  this  pathology  also  suggests  a  way  out.  Recall  that  the  long  wave¬ 
lengths  of  the  velocity  (i.e.  p-wave  velocity,  or  p-wave  and  s-wave  velocities  if 
s-wave  data  is  present,  etc.)  have  a  markedly  different  effect  on  the  data  than 
do  all  other  model  parameters,  such  as  short- wavelength  velocity  fluctuations. 


3 


impedance  contrasts,  etc.  This  dichotomy  of  data  influence  accounts  is  the  main 
source  of  nonconvexity  of  the  least  squares  objective,  mentioned  above.  Differ¬ 
ential  semblance  optimization  separates  the  two  classes  of  parameters,  viewing 
the  long  wavelengths  of  the  velocity  as  “control”  variables  and  the  rest  of  the 
model  as  a  “state”  variable.  The  main  result  of  the  theoretical  and  numerical 
investigation  of  differential  semblance  is: 

If  the  differential  semblance  objective  is  minimized  first  over  the 
“state”  variables,  the  remaining  function  of  the  “control”  variables  is 
smooth  and  convex  in  a  large  domain,  so  can  be  minimized  effectively 
by  gradient-based  optimization  methods. 

We  shall  refer  to  the  long  wavelengths  of  the  velocity,  i.e.  the  “control” 
variables,  as  “the  velocity”,  and  the  rest  of  the  model,  i.e.  the  “state”  variables, 
as  “the  reflectivity”.  Thus  the  two-stage  approach  just  suggested  amounts  to 
inverting  for  reflectivities  given  velocities,  then  stuffing  the  inverted  reflectivities 
into  the  differential  semblance  objective  and  minimizing  the  resulting  function  of 
velocities.  The  separation  into  the  two  classes  can  be  either  implicit  or  explicit. 
In  this  paper  we  explicitly  separate  the  model  into  velocity  and  reflectivity  via 
linearization,  with  the  velocity  used  as  the  background  model.  That  is,  we 
will  adopt  the  primaries-only  approximation.  For  an  example  of  differential 
semblance  with  implicit  separation  and  fully  nonlinear  modeling  (inclusive  of 
multiple  reflections),  see  Symes  1991a  [34]. 

Figure  1  illustrates  the  reasoning  behind  our  main  result.  It  caricatures 
inverted  reflectivities  for  correct  and  incorrect  velocities,  and  displays  slices 
through  the  inverted  reflectivity  hypercube  at  two  adjacent  shot  locations.  In¬ 
verted  reflectivities  from  neighboring  shots  are  similar  in  phase  content  even 
when  the  velocity  model  is  wrong  (provided  that  the  experiment  parameter, 
i.e.  shot  location,  is  sampled  finely  enough).  Therefore  their  differences  change 
slowly  when  the  velocity  model  is  changed.  This  accounts  for  the  smooth  be¬ 
haviour  of  the  differential  semblance  objective. 

Symes  1991a  [34]  and  Symes  and  Carazzone  1991  [37]  give  an  account  of 
an  earlier  implementation  of  differential  semblance  optimization,  specifically 
formulated  for  plane  wave  data  and  layered  earth  models.  This  earlier  version 
also  extracted  reasonable  satisfactory  estimates  of  velocity  and  reflectivity  from 
carefully  selected  and  prepared  field  data  sets. 

The  implementation  presented  here  was  introduced  and  analysed  in  Symes 
1991  [33],  [35]  and  Symes  1992  [36].  It  applies  in  principle  to  virtually  any  class 
of  earth  models  and  presentation  of  the  data.  We  will  limit  our  attention  to 
2D  acoustic  models  and  shot  gather  (common  source)  data  sets.  As  mentioned 
above  we  use  first-order  perturbation  theory  to  view  the  rest  of  the  model  (the 
reflectivity)  as  a  perturbation  about  the  long  wavelengths  of  velocity.  We  regard 
the  data  as  a  measurement  of  the  perturbational  (scattered)  field.  Physically, 
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this  approximation  amounts  to  neglect  of  multiply  reflected  energy.  It  appears  to 
be  roughly  correct  for  some  data  sets,  admissible  after  preprocessing  for  others, 
and  simply  inadequate  for  yet  others.  We  will  also  neglect  density  variations, 
and  regard  the  source  as  known  in  both  waveform  and  radiation  pattern.  These 
last  two  assumptions  are  inessential  to  the  method  -  both  can  be  dropped  at 
the  expense  of  more  complex  implementation.  In  particular,  the  source  wavelet 
and  directivity  can  be  estimated  from  the  data  itself  as  part  of  the  inversion 
(Lewis  1989  [20]).  Since  reflection  amplitudes  in  all  real  data  sets  are  affected 
by  both  density  fluctuations  and  source  details,  we  have  applied  the  constant 
density  implementation  to  synthetic  data  only. 
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Differential  semblance  for  2D  primary  reflection 
acoustics 


This  section  defines  a  mathematically  precise  version  of  differential  semblance 
optimization  for  2D  primaries  only  constant  density  acoustics,  summarizes  some 
of  its  properties,  and  outlines  basic  computational  choices  made  in  our  imple¬ 
mentation. 

We  will  use  a  version  of  the  2D  acoustic  constant-density  primaries-only 
model  of  seismic  wave  propagation.  This  model  assumes  that: 

•  the  earth  is  two-dimensional,  and  occupies  the  half-space  {z  >  0}; 

•  the  acoustic  pressure  field  vanishes  on  the  surface  of  the  earth  (i.e.  on 

{^  =  0}); 

•  density  variations  can  be  neglected; 

•  multiply  reflected  energy  can  be  neglected; 

•  sources  are  temporally  and  spatially  localized,  the  source  characteristics 
are  known,  and  differ  from  shot  to  shot  only  by  position. 

The  model  couples  the  fields 

•  p(x,  z,t;  Xg)  =  (scattered)  pressure  field 

•  po(x,  z,t;  Xg)  =  reference  pressure  field 

•  v(x',z)  =  (smooth)  velocity  field 

•  r(x,z)  =  (rough)  reflectivity  field,  i.e.  relative  velocity  perturbation  (~ 
26v(x, z)lv{x,  z)) 

•  f{x,z,t\Xg)  =  source  (divergence  of  a  body  force  field) 

Here  Xg  ranges  over  a  set  of  source  locations.  The  fields  are  connected  through 
the  set  of  coupled  wave  equations 

=  (2) 

Equation  1  is  the  ordinary  constant-density  acoustic  wave  equation  for  the  ref¬ 
erence  (incident)  pressure  field  po-  First-order  perturbation  theory  plus  a  little 


6 


algebra  produces  equation  2  for  the  perturbation  or  scattered  field  p  in  response 
to  a  perturbation  6v  of  v. 

So  long  as  v  is  smooth,  i.e.  has  a  characteristic  length  scale  of  several 
wavelengths,  the  reference  field  po  contains  only  the  direct  wave  and  refracted 
waves,  and  is  essentially  devoid  of  reflections.  This  well-known  result  follows 
from  geometric  optics  (Friedlander  1958  [13];  for  this  application  see  Claerbout 
1985  [7],  for  example).  The  relative  perturbation  r  =  Sv/v  turns  up  naturally, 
as  a  secondary  source  multiplying  the  Laplacian  of  po  on  the  right-hand  side  of 
the  wave  equation  for  p.  Since  r  is  responsible  for  all  reflections  in  this  model, 
it  is  natural  to  call  it  the  reflectivity. 

To  complete  the  specification  of  the  fields  initial  and  boundary  conditions 
are  necessary.  We  assume  that  the  wave  equations  hold  in  a  spatial  region  R 
and  for  a  time  interval  t  <  t„iax,  and  that  the  fields  are  quiescent  for  negative 
times: 

f{x,z,t;xs)  'j 
Po(x,zJ-,Xs)  V  ~  0,t  <  0 
p{x,z,t;x,}  J 

Ideally  R  is  the  half-space  >  0}.  For  computational  purposes  R  must  be 
a  finite  domain,  which  we  take  to  be  the  rectangle  {xmin  <  x  <  Xmax,^  <  z  < 
^max}-  We  assume  zero  pressure  on  the  surface  {z  =  0}: 

po(a;,0,<;a;,)  1 
p(x,0,<;a;,)  /“ 

It  is  commonplace  to  employ  so-called  absorbing  boundary  conditions  on  the 
other  sides  of  the  rectangle  to  simulate  wave  propagation  in  an  infinite  domain 
(see  eg.  Enquist  and  Majda  1977  [11]).  However  such  conditions  implicitly 
assume  that  no  reflections  occur  outside  R;  if  this  assumption  is  incorrect,  some 
part  of  the  data  may  not  be  explicable  by  the  model.  For  an  example  of  this 
sort  of  effect  in  an  inverse  problem  (see  Noble  1992  [22]).  Instead  we  prefer  to 
bear  the  extra  computational  expen.se  of  putting  the  boundaries  of  R  far  enough 
away  from  the  source  that  no  reflection  can  arrive  in  the  receiver  array  from  the 
boundary  in  the  recording  interval  {0  <  <  <  tmax)-  That  is,  we  assume 

•  that  Po  and  p  are  required  to  vanish  on  a// sides  of  R  for  all  t; 

•  that  the  source  f{x,z,t;Xs)  vanishes  outside  of  a  small  region  near  the 
source  location  Xg,z,,  and  that  the  .source  depth  z,  is  the  same  for  all 
sources; 

•  that  for  each  source  location  Xg  a  set  of  receiver  locations  Xr,  Zr  (“cable”) 
is  given,  and  that  the  receiver  depth  2,-  is  the  same  for  all  source  locations 
Xg  and  receiver  offsets  Xr  —  Xg\ 
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•  that  R  is  sufficiently  large  that  po  takes  the  same  values  as  it  would  if 

calculated  in  the  half-space,  at  all  points  x,-,  ^r  of  the  cable,  for  all  x,,  and 
for  0  ^  ^max  • 

In  practice  it  is  easy  to  check  that  R  is  sufficiently  large,  using  an  average 
value  of  the  slowness  Ijv. 

We  solve  the  wave  equations  1, 2  numerically  using  a  finite  difference  method 
of  fourth  order  in  space  and  second  order  in  time  (Dablain  1986  [8],  Levander 
1988  [19]).  Since  the  boundary  conditions  specify  the  vanishing  of  the  fields  po 
and  p  on  the  boundary  of  R,  the  method  of  images  gives  numerical  boundary 
conditions  (i.e.  one-sided  difference  stencils  near  the  boundary)  of  the  same 
accuracy  as  the  interior  scheme. 

The  simulated  primary  reflection  data  set  consists  of  the  samples  of  the  scat¬ 
tered  field  p  for  various  source  locations  x, ,  as  a;  ranges  over  the  corresponding 
receiver  coordinates  x,.  and  0  <  f  <  tmax-  Since  this  data  set  depends  function¬ 
ally  on  the  fields  v  and  r,  we  write 

F[t;,r]  =  {p(xr,2r,«;x,)}  (3) 

We  call  F  the  forward  map  of  this  model.  Since  the  receiver  positions  do  not 
necessarily  fall  on  the  finite  difference  gridpoints,  numerical  implementation  of 
this  map  requires  an  interpolation  onto  receiver  positions. 

Note  that 

•  F  is  linear  in  r,  since  r  appears  linearly  on  the  right  hand  side  of  2; 

•  F  is  extremely  nonlinear  in  v. 

The  first  property  is  obvious,  and  the  .second  is  by  now  well  established  (Claer- 
bout  1985  [7],  Tarantola  1986  [40],  Santosa  and  Symes  1989  [24],  Symes  1992 
[36]).  The  consequences  for  waveform  inversion  are  very  important.  First,  given 
V,  it  is  relatively  eeisy  to  estimate  r.  Essentially,  estimation  of  r  given  v  requires 
a  before  stack  migration  with  correct  output  amplitudes.  Both  direct  methods 
of  Kirchhoff  type  (Beylkin  1985  [1],  Bleistein  1987  [3])  and  iterative  methods 
based  on  modeling  (Bourgeois  et  al.  1989  [4],  Noble  1992  [22])  work  reasonably 
well.  However  estimation  of  v  and  r  jointly  is  very  difficult,  because  of  the 
second  property.  For  example  the  mean  square  misfit  function 

JLs[v,r]  =  -\\F[v,r]-Fdata\\~  =  :^  ^  \p{Xr,Zr,t;X,)-pdaia{Xr,t-,Xs)\'^  (4) 

Xg  ,Xr,t 

is  highly  non-quadratic  and  does  not  admit  global  minimization  via  gradient 
(local)  optimization  methods.  As  mentioned  in  the  introduction,  several  authors 
have  suggested  stochastic  minimization  of  J/,5. 
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For  the  sake  of  efficiency,  and  for  other  reasons  to  be  discussed  later,  we 
prefer  to  modify  the  problem  and  the  objective  function.  Our  modification 
proceeds  in  two  steps. 

The  first  step  follows  from  the  behaviour  of  J ls  when  only  one  source  (lo¬ 
cation)  is  used.  Then  the  minimization  of  Ji,s  is  actually  quite  straightforward: 
minimization  over  ?’  (the  quadratic  problem)  yields  a  value  essentially  indepen¬ 
dent  of  V.  Thus  minimization  of  J^s  for  a  single  shot  is  easy,  and  does  not 
constrain  v  at  all.  The  difficult  nature  of  minimizing  Jis  in  general  stems  from 
the  necessity  of  fitting  data  from  many  shot  locations  simultaneously,  with  a  sin¬ 
gle  model  [t;,r].  This  circumstance  suggests  relaxing  the  minimization  problem 
by  allowing  the  model  to  depend  on  the  source  location.  Since  v  is  unconstrained 
by  independent  fits  to  the  various  single-source  data  sets,  only  r  actually  need 
be  allowed  to  vary  with  x, : 


V  =  v(x,  ;) 
r  -  r(x,z;x,) 

Now  it  will  be  easy  to  achieve  essentially  zero  residual  value  of  the  functional 
defined  in  4,  but  at  the  cost  of  losing  control  completely  over  v.  Also  the 
estimates  of  r  will  actually  depend  on  x^,  which  is  nonsense  since  r  is  supposed 
to  be  part  of  an  earth  model.  The  second  step  recovers  control  over  v,  in  a 
calibrated  way,  by  penalizing  the  dependence  of  v  on  x^.  For  reasons  discussed 
in  a  general  way  in  the  introduction,  and  justified  mathematically  in  Symes 
1991c  [35]  and  1992  [36],  we  choose  to  add  differential  penalty  term: 

^£)s[t',»’]=  ^{\\F[v,r]-  fdatalf  +  a'^\\dr/dx,f}  (5) 

This  is  the  differential  semblance  objective  function.  Making  the  first  term  small 
forces  some  level  of  fit-to-data,  independently  for  each  source  location.  Making 
the  second  term  small  forces  the  x^-dependent  reflectivities  to  resemble  each 
other.  The  two  goals  can  be  achieved  simultaneously  only  when  the  velocity  is 
kinematically  correct.  The  weight  parameter  <7  controls  the  amount  of  penalty 
applied  for  incoherence  of  )•,  and  thus  interpolates  between  totally  independent 
data  fitting  (  4  for  r  =  r{x,  z;  x,),  equivalent  to  5  for  cr  =  0)  and  the  ordinary 
least-squares  principle  (  4  for  r  =  r(x,  z),  equivalent  to  5  for  oo). 

Now  Jds  is  no  more  smooth  or  convex  than  is  Jls,  since  it  essentially 
contains  J^s  as  its  first  term.  However,  as  argued  in  the  introduction  Jos  can 
still  be  minimized  effectively  via  local  optimization  algorithms,  provided  that 
the  optimization  is  divided  into  two  stages: 

1.  first  carry  out  the  (quadratic)  minimization  over  r,  as  a  result  of  which  r 
becomes  a  function  of  v:  r  -  r[t;]  (“the  inverted  reflectivity”); 

2.  then  minimize  the  residual  J£>s[u,  j-[y]]  of  step  1  over  v. 
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Two  elements  of  this  scheme  cooperate  to  make  it  effective.  First,  only 
inverted  reflectivities  from  neighboring  source  locations  are  compared.  These 
are  similar  even  when  the  velocity  field  v  is  substantially  wrong.  Numerical 
implementation  replaces  the  partial  derivative  by  a  divided  difference,  of  course: 

^1'  >•(-,  -.a;,  +  Aa;,)  -  r(-,  ■,x,) 

dx.  Ax, 

So  long  as  the  sample  rate  Ax,  is  small  enough  relative  to  the  dominant  spatial 
wavelength  and  the  maximum  traveltime  error,  this  difference  will  not  involve 
any  cycle-skipping. 

Second,  inverted  reflectivities  are  compared.  These  really  should  all  equal 
the  target  (x, -independent)  reflectivity  if  the  velocity  is  chosen  correctly.  In 
particular  the  r[D](-,  •,  should  largely  be  free  of  amplitude  artifacts,  apart 
from  those  unavoidable  errors  caused  by  limited  data  aperture.  Therefore  a 
simple  divided  difference,  as  above,  should  be  sufficiently  robust  to  act  as  a 
velocity  indicator. 

Apart  from  these  intuitive  arguments,  we  have  given  some  mathematical 
analysis  in  Symes  1991c  [35]  and  1992  [36]  which  shows  that  J£is[vtN]  is 
smooth  and  convex  in  a  large  region  of  velocity  models  v,  for  proper  choice 
of  the  parameter  <t.  Also,  the  numerical  evidence  is  strongly  in  favor  of  this 
conclusion,  as  will  be  reviewed  below. 

Now  suppose  that  the  data  Fdata  is  consistent:  for  some  model  [v*,r*], 

Tdata  =  T[t;*,  ?•*]  (6) 

Here  r*  is  supposed  to  be  a;j,-independent.  Then 


■/tsb*,***]  =  0  =  Josb*,?’*] 

so  that  both  objective  functions  share  the  same  global  minimum.  In  other  words, 
global  minimization  of  Ji^s  is  accomplished  by  global  minimization  of  Jos  in 
the  consistent  (noise-free)  data  case.  Thus  one  can  view  differential  semblance 
optimization  as  an  infeasible  point  method  for  least  squares  inversion. 

We  also  have  some  mathematical  and  considerable  numerical  evidence  that 
the  minimization  of  Jos  is  stable  in  a  suitable  sense.  That  is,  suppose  that  6 
fails  but  the  two  sides  are  close  in  the  mean  square  sense  (i.e.  the  RMS  data 
noise  is  small).  Then  the  minimizer  v  of  Jos[t',  '"[u]]  is  kinematically  close  to  v* , 
in  the  sense  that  these  generate  RMS-close  travel  time  fields  from  all  source  and 
receiver  positions.  This  conclusion  appears  to  continue  to  hold  even  for  quite 
large  amounts  of  incoherent  data  noise. 
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Algorithmic  details 


Evidently  the  two-stage  procedure  for  minimizing  Jos  outlined  in  the  last  sec¬ 
tion  really  requires  an  algorithm  for  minimizing  the  function  of  velocity  alone: 

J[v]  =  Josb,  r[n]] 


where 

r[t)]  =  argmin,.JDs[v,r\. 

The  differential  semblance  objective  function  Jos  is  defined  by 

-  FdataW^  +  (^‘^Wdr/dxsW^  -b  A^||fEr||^} 

Here  we  have  added  a  regularizing  term  with  weiglit  A  and  operator  W,  to 
ensure  that  the  inner  minimization  over  r  is  well-behaved.  Naively  one  could 
take  W  =  I,  and  this  works  to  some  extent.  A  better  choice  is  IT  =  ^x,z,x.,  as 
discussed  below. 

We  shall  use  the  nonlinear  conjugate  gradient  algorithm  as  presented  in 
Fletcher  1980  [12]  to  minimize  J.  This  algorithm  produces  a  sequence  Vi,i  = 
0, 1,2, ...  according  to  the  rules 

1.  Initialization: 

Vi  —  S'o  =  0, /?o  —  0 

2.  for  =  1  until  convergence  do: 


Sk  =  -gradJ[vk]  +  0k-iSk-i 


Vk  +  l  =Vk+  OlkSk 


The  step  update  parameter  /?*,  is  an  algebraic  combination  of  inner  products  of 
the  gradients  gradJ\vk-i]  and  gradJ\vk\  (Polak-Ribiere  formula).  The  step  at  is 
chosen  by  a  line  search,  which  is  a  safeguarded  version  of  the  secant  minimization 
method  for  functions  of  a  single  variable.  It  involves  algebraic  combinations  and 
tests  using  the  function  value  /[vfl-ast]  and  the  slope  <  Sk,gradJ[vk  +  ask]  >v 
Here  <  •,  •  >„  is  the  symbol  we  shall  use  for  the  inner  product  in  the  Hilbert 
space  of  velocity  changes.  The  same  inner  product  is  used  in  the  Polak-Ribiere 
formula.  Convergence  is  tested  by  decrease  of  the  gradient  norm: 

convergence  ||yradJ[t;t]||„  <  e|isra(iJ[?;i„i„a(|k 

Here  ||  •  ||t,  is  the  norm  associated  with  the  inner  product  <  •,  •>„. 

To  carry  out  this  program,  we  must  settle  on  methods  for  computing 
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1.  the  inner  product  <  >„; 

2.  the  value 

3.  the  gradient  gradJ[v\. 

The  subsections  to  follow  describe  each  of  these  calculations. 


The  Hilbert  space  structure  of  velocity  updates 


The  theory  underlying  all  of  the  velocity  analysis  methods  mentioned  in  the  in¬ 
troduction,  including  difl'erential  semblance,  depends  on  the  separation  of  scales 
between  velocity  in  the  one  hand  and  reflectivity  on  the  other.  In  many  cases 
the  large-scale  nature  of  velocity  has  been  enforced  through  parameterization, 
by  selecting  velocities  from  a  fixed  low-dimensional  space  of  smooth  functions. 
Obviously  such  a  constraint  is  rather  arbitrary,  and  it  is  far  from  obvious  that  a 
choice  of  parameterization  could  be  made  which  yields  large-scale  structure  and 
also  reproduces  travel  times  with  the  necessary  fidelity  in  every  situation.  If 
the  initial  velocity  estimate  u,ni(ta(  is  smooth  (for  instance  constant  or  constant 
gradient),  then  control  of  the  length  scales  in  the  space  of  velocity  updates  is 
the  important  issue.  We  prefer  to  allow  the  velocity  updates  6v  the  same  formal 
degrees  of  freedom  as  the  other  grid  functions,  and  enforce  only  smoothness 
through  the  choice  of  norm  or  inner  product.  In  this  approach  the  Hilbert  space 
structure  of  velocity  updates  becomes  extremely  important. 

We  use  a  discrete  analogue  of  the  Sobolev  scale  of  Hilbert  norms.  These  are 
based  on  the  anisotropic  discrete  Laplace  operator 


6v. 


-b  w~  - 


_ 


A22 


=  {L6v)ij 


With  suitable  boundary  conditions  this  operator  is  diagonalized  by  the  Fourier 
transform,  so  it  is  possible  to  compute  arbitrary  powers  of  L  very  efficiently. 
The  sth  Sobolev  norm  is  given  by 


where  the  norm  without  the  subscripts  is  the  ordinary  norm 


||nlP  =  Aj;A2^|wijP 


and 

A  =  (/-  L)5 


The  operator  A  multiplies  the  {k,l)  discrete  Fourier  coefficient  of  6v  by  (1  -f 
w'^k^Ax^  -f  w1l~Az~)^  for  small  wavenumbers  kAx,lAz.  Thus  the  power  s 


12 


determines  the  rate  at  which  high  frequencies  are  emphasized,  and  the  weights 
Wx,w^  determine  the  corner  frequencies  in  each  direction. 

For  reasons  to  be  explained  below,  we  constrain  the  velocity  updates  to 
vanish  in  a  surface  layer  containing  the  source: 

6v{x,  z)  =  0  for  0  <  2  <  z,„in 

The  boundary  conditions  imposed  on  L  in  the  definition  of  A  should  include 
u  =  0  at  the  grid  level  just  below  Zmin-  The  other  three  boundary  conditions 
are  less  important. 

Besides  defining  the  inner  products  in  the  conjugate  gradient  formulae,  the 
operator  A*  plays  a  critical  role  in  the  definition  of  the  adjoint  operators,  and 
thereby  on  the  gradient.  Indeed,  suppo.se  that  A  is  some  other  operator  taking 
velocity  updates  Sv  to  some  other  sort  of  grid  functions.  We  use  the  Sobolev 
s  norm  on  the  space  of  velocity  updates,  and  the  ordinary  (L^)  norm  on  the 
other  space  of  grid  functions.  We  assume  that  the  ordinary  (L~)  transpose  or 
conjugate  operator  A'^  is  available.  That  is,  A^  is  the  operator  which  would 
be  computed  by  transposing  the  matrix  of  A,  if  one  could  access  the  matrix 
elements  of  A  (out  of  the  question  in  almost  all  examples  of  interest  to  us!).  It 
is  defined  by  the  condition 


<  A6v,<j)  >=<  8v,A^(j)  > 

for  all  suitable  grid  functions  6v,(f).  On  the  other  hand  the  adjoint  for  the 
Sobolev  s  norm  on  Sv  is  defined  by  the  similar  formula 

<  ASv,(^  >=<  Sv,A*<p  >,=<  A"6v,A^A*<t>  >=<  6v,A'^''A*(I)  > 

Inspecting  these  two  formulas  we  see  that 

=  A^"A* 


Since  A  is  inverse  to  A^’,  this  is  equivalent  to 

A*  =  A-2m^  (7) 

From  the  remark  above  about  the  Fourier  coefficients  we  see  that  if  s  >  0,  then 
A~^’  suppresses  high  frequencies,  i.e.  smooths  its  argument.  Thus 

the  ordinary  transpose  of  an  operator  is  converted  to  its  adjoint  in  the 
sense  of  the  Sobolev  s-norm  by  application  of  a  smoothing  operator. 

This  observation  will  play  a  crucial  role  in  explaining  the  construction  of  the 
velocity  gradient  below. 


13 


Evaluation  of  J[t;] 

Evaluation  of  J  requires  solution  of  the  quadratic  optimization  problem 

minimizer  J D s\v ,  »■] 

to  produce  as  well  as  the  value  J[v]  =  Josb, '’b]]- 
Recall  that 

Jds[v,  r]  =  -  F^ataW^  +  u-^1|3j73x,|P  +  A^IIVErll^} 

The  variation  or  directional  derivative  of  Jps  in  the  direction  Sr  is 

=  <  DrF[v,r]Sr,  F[v,r]- Fdata  > 

+  cr-  <  d6r/dxs,dr/dx,  > 

+  <  WSr,  Wr  > 

Here  <  •,  •  >  is  the  norm  over  x,z,  and  x,  (discretely  the  Euclidean  dot 
product  weighted  by  AaiAzAa;,).  Because  of  the  convention  mentioned  in  the 
last  section,  that  the  boundaries  of  the  computational  domain  R  are  either 
pressure  free  or  are  so  far  away  as  not  to  matter,  we  can  assume  that  r  and  Sr 
vanish  near  the  boundary  of  R  without  loss  of  generality.  Integration  (discretely, 
summation)  and  definition  of  the  adjoints  gives 

^Jds[v,  r]  =<  Sr,  DrF[v,  r]*(E[u,  ?•]  -  Fdata)  -  cr^d-rjdx^  +  X^W^Wr  > 

from  which  we  read  off  immediately 

gradrJDs[v,r]  =  Hrf  [n, r]*(f  [u,  7-]  -  Fdata)  -  cr-d’^r/dx]  +  X'W'^Wr 

Setting  the  gradient  to  zero  gives  the  normal  equation.  Because  F  is  linear  in 
r, 

DrF[v,  r\Sr  —  F[v,  ^r] 

whence  the  adjoint  operator 


M[v]  =  DrF[v,rY 

is  independent  of  r  and  the  normal  equation  is  linear  in  r.  With  proper  choice 
of  regularization,  this  positive  definite  symmetric  linear  system  is  necessary  and 
sufficient  for  optimality  of  r.  It  can  be  re-written  as 

Af[u]r  M[v]Fdata  (8) 

where 

A[r]7-  =  M[v]F[v,  r]  —  cP’d’^rjdxl  -f  X'^W^Wr 
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is  the  normal  operator. 

Simply  to  evaluate  the  both  sides  of  the  normal  equations  requires  algo¬ 
rithms  for  the  action  of  TV’fn]  and  Our  approach  to  solving  the  normal 

equations  will  also  make  repeated  use  of  such  algorithms.  The  normal  opera¬ 
tor  N[v]  is  composed  of  M[v\,  F[d,  r],  d/dx,,  and  the  regularizing  operator  W, 
which  is  either  a  scalar  operator  or  made  up  out  of  derivatives  in  our  current 
implementation.  The  derivatives  are  approximated  by  finite  difference  opera¬ 
tors  in  obvious  ways,  and  we  described  the  calculation  of  T[i;,r]  in  the  last 
section.  Thus  it  only  remains  to  specify  an  algorithm  for  the  action  of  the  ad¬ 
joint  operator  M[v]  on  a  field  of  “residuals”  <f>(Xr,t;Xg).  This  algorithm  is  the 
so-called  adjoint  state  method,  and  has  been  explained  a  number  of  times  in 
the  literature;  see  especially  Lailly  1983  [18],  Tarantola  1984  [39].  In  brief, 

1.  backpropagate  the  residuals,  i.e.  solve  the  final  value  problem  for  each  x,: 

1  d'^w 

-  ^^w){x,z,t;x,)  =  '^<j){Xr,t;Xg)6(x  -  Xr,Z-  Zr)  (9) 

•Pr 

W  :zz  O^i  ^  imax 

2.  crosscorrelate  the  backpropagated  field  w  with  the  incident  field  pq: 

M[v](l)  —  ~~  J  diwV'^po 

The  field  w  satisfies  the  same  (homogeneous)  boundary  conditions  as  do  po  and 
p.  Note  that  the  final  values  of  po  can  be  saved,  and  the  field  reconstructed 
backwards  in  time,  since  the  wave  equation  is  time-reversible.  Thus  the  output 
of  the  crosscorrelation  can  be  accumulated  as  the  calculation  proceeds,  with 
tremendous  savings  in  storage.  Note  that  the  output  M[v](}>  in  this  context 
is  ar^-dependent.  In  fact,  M[v]  is  exactly  a  prestack  migration  operator.  The 
analogous  operator  for  ordinary  least-squares  inversion  is  computed  in  exactly 
the  same  way,  but  includes  a  final  summation  (stack)  over  Xg. 

Because  of  the  size  of  the  normal  equation  8,  it  is  most  conveniently  solved 
by  a  polynomial  iteration.  While  the  conjugate  gradient  family  is  favored  in 
most  similar  work,  and  is  optimal  in  a  sense  we  will  not  make  precise,  we  prefer 
to  use  Chebyshev  iteration  for  reasons  that  will  become  clear  later. 

Chebyshev  iteration  involves  four  distinct  steps: 

1.  choosing  an  inversion  level', 

2.  estimating  the  spectral  bound  and  scaling  the  normal  operator; 

3.  computing  the  Chebyshev  coefficients  and  estimating  the  necessary  num¬ 
ber  of  iterations; 
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4.  application  of  the  Chebyshev  polynomial  by  recursive  application  of  the 
normal  operator. 

A  good  reference  on  clcissical  Chebyshev  iteration  as  applied  to  discretized  el¬ 
liptic  boundary  value  problems  is  Varga  1962  [43].  Stork  and  Clayton  1991  and 
1992  [31],  [32]  reported  the  use  of  Chebyshev  accelerated  Richardson  iteration 
applied  to  reflection  tomography.  They  pointed  out  the  the  advantages  of  the 
control  provided  by  Chebyshev  iteration  over  the  portion  of  the  spectrum  in¬ 
verted.  Their  algorithm  is  somewhat  different  than  ours.  In  particular  they  are 
able  to  normalize  their  operator  so  as  to  be  able  to  apply  Chebyshev  acceler¬ 
ation  directly,  whereas  we  must  compute  an  approximate  spectral  bound.  We 
were  unable  to  locate  a  description  of  the  modifications  necessary  to  handle  the 
present  problem,  so  we  will  give  a  fairly  detailed  description  of  the  algorithm. 


Choosing  the  inversion  level 

Chebyshev  iteration  reduces  the  error  and  residual  components  of  a  symmetric 
linear  system  to  a  specified  level  relative  to  their  original  values  over  a  specified 
part  of  the  operator  spectrum.  For  positive  definite  problems  like  the  normal 
equations  it  is  natural  to  specify  a  lowest  eigenvalue  at  which  the  reduction  will 
be  achieved,  and  to  give  its  value  relative  to  the  largest  eigenvalue  (spectral 
bound)  Xmax  of  iV[v].  Therefore  it  is  necessary  to  choose  the  error  reduction 
factor  €  and  the  inversion  level  y.  The  iteration  will  be  designed  so  that  if 
7 Xmax  <  <  Xmax  and  X  is  an  eigenvalue  of  Af[r;]  then  the  corresponding 

component  of  the  normal  residual  M[v]F4aia  —  A^[i’]j’  will  be  reduced  by  the 
factor  e  (at  least).  Evidently  a  good  choice  of  inversion  level  depends  on  the 
spectral  structure  of  the  data,  as  does  the  choice  of  error  reduction  factor.  Stork 
and  Clayton  1991,  1992  [31],  [32]  suggest  that  in  tomographic  inversion  y  should 
be  chosen  quite  small,  a  typical  figure  being  0.04. 


Estimating  the  spectx'al  bound  and  scaling  the  normal  operator 

Of  course  the  spectral  bound  is  itself  unknown  a  priori.  It  represents  the  natural 
scale  of  the  problem,  so  is  valuable  information  in  its  own  right.  Fortunately  un¬ 
derestimates  are  a  by-product  of  the  iteration,  in  the  form  of  Rayleigh  quotients 
of  the  reflectivity  updates  Srk,  at  the  cost  of  a  couple  of  extra  inner  products: 

N[v]6rk  > 

k  —  ^  c  c  's^ 

<  ork,ork  > 

Any  number  strictly  larger  than  the  maximum  of  these  estimates  can  be  used 
in  the  place  of  the  actual  spectral  bound  Xmax-  The  additional  steps  needed  to 
make  use  of  these  adaptive  spectral  estimates  are  as  follows: 


16 


1.  choose  a  “fudge  factor”  a  >  1.0; 

2.  replace  the  error  reduction  factor  e  by 

_  \/a  —  le 

^est  —  ~  = 

1  +  \/a  —  1 

3.  initialize  the  spectral  bound  Xgst  by 

^est  —  aRQo 

or  by 

Xesi  =  max(Xesi,aRQo) 

if  a  previous  estimate  is  available. 

4.  if  at  step  k,  RQk  >  replace  Ae.,i  by  aRQk,  recompute  Sest  and  all 
quantities  dependent  on  it,  and  restart  the  iteration. 

It  is  simple  to  show  that  the  adaptive  iteration,  thus  safeguarded,  produces  final 
error  reduction  of  at  least  e. 

In  practice  we  have  found  that  the  spectral  bound  estimated  in  this  way 
is  quite  stable.  Initially,  one  or  two  (or  rarely  three)  restarts  are  necessary. 
Subsequent  similar  inner  inversions  with  updated  velocities,  as  occur  in  the 
outer  inversion  loop,  can  usually  proceed  with  a  previous  A^jf. 

Since  Chebyshev  iteration  in  its  original  form  applies  to  operators  with  spec¬ 
tra  lying  between  —1  and  1,  it  is  in  principle  necessary  to  scale  and  translate 
A^[v],  i.e.  to  replace  it  by  /  —  sA^[u]  for  a  suitable  scale  factor  s,  to  satisfy  this 
condition.  We  have  re-written  the  algorithm  so  that  this  step  is  carried  out 
implicitly;  nontheless  the  scale  factor  s  is  needed.  An  optimal  choice  is 

2 

"  (1  +  7)A 

where  in  practice  A„jar  is  replaced  by  Xg,t. 


Computing  the  Chebyshev  coefficients  and  estimating  the  necessary 
number  of  iterations 

The  necessary  coefficients  are  generated  by  the  following  recursions,  in  which 
the  quantity 
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figures  prominently  (see  eg.  Varga  1962  [43]): 


Co  = 

Cl  = 

Cfc+1  = 
Wo  = 
wi  = 
Wfc  +  l  = 


1 

j[ 

1 


^c*  -  Ck-\,k  =1,2, ... 

0 


1  +  =  1,2,... 

Cfc  +  l 


The  error  reduction  factor  after  k  steps  is 

~  1  +  9^^ 

where 

9.  _ 

1-  v/T^ 

The  simplest  way  to  ensure  the  required  decrease  is  to  compute  these  numbers 
until 

which  give  the  necessary  number  of  iterations  ifcmar- 


Application  of  the  Chebyshev  polynomial  by  recursive  application  of 
the  normal  operator 

The  usual  arrangement  of  Chebyshev  iteration  is  a.s  a  three-term  recursion  - 
see  Varga  1962  [43]  for  instance.  We  prefer  to  arrange  the  algorithm  so  that  it 
resembles  the  usual  form  of  conjugate  gradient  iteration,  as  follows: 

1.  initialize: 

0 
0 

lV[r]eo 

<  eo ,  nro  > 

<  Cq,  Co  > 


Co  = 
Si’o  = 
Co  = 
nro  = 

RQo  = 
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2.  for  =  0  :  kmax  -  1 

(ujk+i  -  l)Srk  +  SLJk+iCk 
rk  +  Svk+i 
N[v]6rk+i 

ejt  -  n7-i-+i 

<  ^rk+uni'k+i  > 

<  8rk+u6rk+i  > 

The  final  outputs  of  this  process  are  the  estimated  reflectivity, 

and  the  estimated  normal  residual 

-  M[v]Fdaia  ~  N[v]r[v]e:t, 

We  then  compute  the  seismogram  misfit 

=  F[v,  r[t;]e,(]  -  Fdata 
and  finally  the  estimated  value  of  J: 

=  +  \\dr[v]^,tldx,\\-  +  ||Wr[7;]e„||^) 

Computation  of  the  velocity  gradient  gradJ[v] 

For  a  perturbation  8v  of  the  velocity  we  obtain  the  directional  derivative 

DJ[v\8v  =  +  (A-iDs)[t',  7'[v]]Z?„7-[w]57;  (10) 

We  will  first  assume  that  we  have  actually  solved  the  normal  equation  8  derived 
in  the  preceding  pages,  and  produce  in  this  way  a  naive  gradient  approximation. 
Since  equation  8  states  exactly  that  the  7--derivative  of  Jjjs  vanishes  at  [ti,  7’[7;]], 
the  second  term  in  equation  10  vanishes.  Since  the  differential  semblance  and 
regularization  terms  are  independent  of  v,  we  obtain 

DJ[v]8v  =  (Dt,  j£)s)[t7,7-[t)]]6t;  =<  D^F[v,r[v]]8v,  F[v,r[v]]  -  Fdata  > 

Evidently  we  must  construct  a  bilinear  operator  B  so  that  for  arbitrary  data-like 
<8,  reflectivity-like  q,  and  velocity-like  v  and  8v, 

<  D^F[v,q]6v,(t>  >=<  6v,  B[q,(p]  >  (11) 


= 

t-a-i-i  = 
7r7-i+i  = 

Cfc  +  l  = 

RQk+\  - 
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Then 

DJ[v\8v  =  <  6v,  B[r[D],  F[v,  r[w]]  -  Fdata]  > 

=  <  6v,  A~^''B[r[t;],  F[v,  r[t;]]  -  Fdata]  >, 

recalling  the_  discussion  of  Sobolev  adjoints  earlier  in  the  section.  Thus  the 
gradient  of  J  in  the  sense  of  the  Sobolev  s-norm  is 

gradJ[v]  =  F[v,  -  Fdata]  (12) 

As  noted  before,  A~^‘'  is  a  smoothing  operator  for  positive  s. 

Of  course  in  practice  we  must  replace  the  reflectivity  r[t;]  by  its  estimate 
obtained  as  explained  in  the  last  section,  and  the  misfit  Fft;,  r[v]]  —  Fdata 
by  its  estimate  These  replacements  in  12  yield  the  naive  gradient 

estimate: 

gradJ[v]est  =  m[v]es,]  (13) 

To  complete  the  description  of  the  naive  gradient  estimate,  it  only  remains  to 
describe  the  calculation  of  B.  This  calculation  is  an  extension  of  the  adjoint  state 
technique,  and  is  derived  in  Appendix  A.  It  is  very  similar  to  the  computation 
of  gradients  for  functions  of  migrated  images  proposed  by  Chavent  and  Jacewitz 
1990  [6].  The  steps  are: 

1.  compute  the  backpropagated  residual  field  lu,  i.e.  solve  the  problem  9  as 
in  the  calculation  of  M[v]\ 

2.  compute  the  secondary  backpropagated  field  wq  satisfying 

1 

~  '^'^o)(^-,-!:,t;Xs)  =  2V^(qw) 

Wq  “  0,/  ^  tinax 

with  homogeneous  boundary  conditions;  Wq  can  be  stepped  backwards  in 
time  along  with  w  and  the  reconstructions  of  po>  P  in  the  finite  difference 
scheme; 

3.  compute  the  double  crosscorrelation 

B[q,4>]  ~  ~  J  dt{w{V-p  +  2qV"^po)  +  wqV'^po) 

Thus  the  calculation  of  the  naive  gradient  approximation  is  a  reasonably 
straightforward  extension  of  the  usual  gradient  calculation  from  least-squares 
inversion.  Unfortunately  the  naive  gradient  13  is  not  very  accurate  as  an  ap¬ 
proximation  to  the  exact  gradient  12.  We  will  illustrate  this  in  the  next  section. 
The  inaccuracy  can  be  large  enough  to  render  13  useless  for  optimization  of  J ! 
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In  fact,  we  have  neglected  the  second  term  in  the  equation  10.  This  can  be 
rather  large,  even  when  the  normal  equation  is  solved  rather  precisely,  because 
it  involves  the  derivative  of  the  reflectivity  estimate  with  respect  to  velocity.  It  is 
the  anomalously  large  size  of  the  derivative  of  the  synthetic  data  with  respect  to 
velocity  that  led  to  nonconvexity  of  the  mean  square  error  function.  For  exactly 
the  same  reason  -  phase  perturbation  of  rapidly  oscillating  signals  -  neglect  of 
the  second  term  in  10  leads  to  large  errors. 

Our  resolution  of  this  problem  involves  computing  the  second  term  in  10 
-  not  all  of  it,  which  would  be  prohibitively  e.xpensive,  but  its  principal  part, 
which  turns  out  to  be  available  at  reasonable  cost.  The  calculation  is  tied  to 
our  method  of  producing  namely  Chebyshev  iteration,  and  is  derived  in 

Appendix  B: 

1.  Carry  out  the  Chebyshev  iteration  as  above  to  obtain  estimates  of  reflec¬ 
tivity  7‘[u]e,t,  normal  residual  e[v]est,  and  seismogram  misfit 

2.  Rerun  Chebyshev  iteration  with  the  norma!  residual  as  data,  to  produce 
the  auxiliary  field  y: 

2/0  =  0 
hyo  =  0 
fo  —  o[u]e5( 

mjo  =  AH/o 

For  —  0  .  hrnax  1 

6yk+i  =  (wjt+i  -  \)6yk  +  swk+\fk 
Vk+i  =  yk-\-hyk+\ 
nyt+i  =  N[v]6yk-ki 
fk+i  =  fk  -  nyk+i 
The  set  2;[o]eS/  =  yk„,„- 

3.  Compute  the  synthetic  of  the  auxiliary  field; 

4.  Replace  the  naive  gradient  formula  13  by 

gradJ[v]t„t  =  -  yHe,t,  m[u]e,t]  -  R[r[?;]„(,  sr[t;]e,t])  (14) 

Because  of  the  second  Chebyshev  iteration  and  the  extra  calculations  of  the 
forward  map  F  and  the  bilinear  operator  B,  the  corrected  gradient  14  is  roughly 
twice  as  expensive  as  the  naive  gradient  13.  Despite  the  added  expense,  we 
advocate  its  use  because 
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•  we  are  able  to  give  a  mathematical  proof  that  the  corrected  gradient  error, 
i.e.  the  difference  between  12  and  14,  tends  to  zero  as  kmax  — *■  oo; 

•  in  many  numerical  examples,  some  of  which  are  reviewed  below,  the  cor¬ 
rected  gradient  has  proven  /or  more  accurate  than  the  naive  gradient,  for 
modest  kmax-  In  fact  usually  the  corrected  gradient  with  kmax  iterations  is 
a  great  deal  more  accurate  than  the  naive  gradient  with  2kmax  iterations, 
so  the  correction  also  appears  to  be  efficient. 

We  are  not  so  far  able  to  prove  gradient  convergence  for  any  correction 
scheme  if  Chebyshev  iteration  is  replaced  by  conjugate  gradients.  Also  a  limited 
amount  of  numerical  evidence  suggests  that  conjugate  gradient  derived  gradients 
are  really  not  corrected  by  these  tricks.  Apparently  the  constant  polynomial 
coefficients  of  Chebyshev  iteration  actually  make  the  estimated  J  smoother.  In 
any  case  this  partly  explains  our  preference  for  Chebyshev  iteration,  for  this 
application. 

We  have  now  explained  how  all  pieces  of  the  nonlinear  conjugate  gradient 
iteration  are  calculated.  We  proceed  to  examine  some  numerical  results-. 
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Numerical  investigation 


We  present  several  numerical  examples  of  differential  semblance  optimization 
applied  to  synthetic  data  sets.  The  examples  fall  into  two  groups.  The  first 
group  concerns  estimation  of  a  layered  model.  We  make  use  of  the  observa¬ 
tion  that  the  translational  symmetry  of  layered  data  gives  full  information  on 
moveout  from  only  two  neighboring  shot  records.  The  use  of  a  two  shot  “line” 
minimizes  the  expense  of  each  trial,  and  allowed  us  to  perform  much  more  exper¬ 
imentation  than  would  have  been  possible  with  larger  simulations.  The  second 
group  of  examples  tests  the  ability  of  differential  semblance  to  detect  strong 
lateral  heterogeneity  by  exhibiting  gradients  of  the  objective  j. 

All  examples  presented  here  share  the  same  source  and  receiver  parameters. 
The  “cable”  consisted  of  34  point  receivers  spaced  50  m  apart,  12  m  beneath  the 
free  surface.  Near  offset  is  150  m,  far  offset  1800  m.  The  source  is  also  punctual 
and  isotropic,  and  is  located  at  8  m  depth.  Shot  spacing  was  96  m.  The  source 
waveform  was  a  Ricker  wavelet  with  15  Hz  peak  frequency.  The  sample  rate 
was  4  ms,  and  2  s  of  data  were  used.  A  mute  was  applied  to  remove  direct  and 
postcritical  energy. 

The  finite  difference  grid  interval  was  16  m  in  both  directions.  For  the 
velocity  ranges  used,  this  choice  gave  roughly  5  gridpoints  per  wavelength  at 
minimum  velocity  and  peak  frequency.  Some  evidence  of  numerical  dispersion 
from  the  (4,2)  finite  difference  scheme  is  visible  in  the  synthetics.  However 
numerical  dispersion  did  not  appear  to  be  .so  severe  as  to  change  the  character 
of  the  results. 

Layered  model 

The  target  layered  model  appears  in  Figure  2,  and  the  two-shot  data  set  in 
Figure  3.  Note  that  the  water  bottom  event  is  mostly  removed  from  the  data 
by  the  (very  crude  linear)  mute,  and  so  the  zone  from  the  surface  to  roughly 
650  m  is  essentially  quiet. 

An  effect  entirely  peculiar  to  2  shot  data  sets  and  layered  media  is  that  the 
actual  value  of  the  differential  semblance  weight  cr  becomes  essentially  irrelevant. 
Figure  4  shows  the  curves  generated  by  sampling  the  objective  function  J  along 
the  line  from  the  constant  velocity  profile  (v  =  1.5  m/ms)  and  the  target  profile, 
for  several  values  of  a.  All  curves  are  es.sentially  quadratic.  In  the  experiments 
we  used  cr  =  10.0.  We  also  used  a  very  small  amount  of  damping  (0.1  %  of 
the  maximum  eigenvalue  estimate)  to  enhance  convergence  of  the  Chebyshev 
iteration  for  reflectivity. 

We  emphasize  that  the  eventual  independence  from  a  of  the  objective  func¬ 
tion  is  characteristic  only  of  two-shot  data  sets  and  layered  media.  In  effect. 
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for  large  a  the  two  reflectivity  gathers  are  forced  to  be  the  same,  and  the  DSO 
objective  is  essentially  identical  to  the  output  least  squares  objective  after  min¬ 
imization  over  reflectivity.  Since  the  two  source  positions  are  close,  the  inverted 
reflectivity  from  one  data  gather  would  yield  a  slowly  varying  error  in  predicting 
the  other  data  gather,  even  when  the  velocity  is  dramatically  wrong.  Therefore 
the  same  is  true  of  the  reflectivity  optimized  for  both  data  gathers  simultane¬ 
ously.  Otherwise  put,  since  the  source  positions  are  close,  no  “cycle  skipping” 
occurs:  for  this  configuration,  output  least  squares  is  effective,  provided  that  op¬ 
timization  proceeds  in  the  two-stage  fashion  we  have  devised  for  DSO!  As  soon 
as  more  distance  between  shot  points  is  involved,  £is  is  the  case  when  the  num¬ 
ber  of  shots  becomes  significantly  larger,  the  differential  semblance  objective 
depends  quite  strongly  on  a,  e.xhibiting  convexity  for  small  a  and  approaching 
the  (quite  nonconvex)  output  least  squares  objective  for  large  a. 

For  the  inner  products  in  the  conjugate  gradient  formulas,  and  to  smooth  the 
gradient  estimate  as  noted  before,  we  used  the  Sobolev  2-norm  on  the  space  of 
velocity  updates.  We  set  the  anisotropic  weight  parameters  in  the  Laplacian  so 
that  the  velocity  gradient  calculation  would  suppress  Fourier  components  with 
wavelengths  of  less  than  400  m  to  5%  or  less  of  their  size  in  the  unsmoothed 
gradient. 

For  this  set  of  experiments  we  defined  the  Chebyshev  polynomial  (used  in  the 
inner  optimization  for  reflectivity)  to  suppress  error  components  in  the  normal 
residual  above  2%  of  the  (estimated)  maximum  eigenvalue  to  at  most  2%  of 
their  initial  values.  This  clioice  implied  19  inner  iterations.  Parameters  for 
the  (outer)  nonlinear  conjugate  gradient  iteration  over  velocity  were  eis  advised 
in  Fletcher  1980  [12],  In  all  cases  we  used  the  corrected  gradient  construction 
explained  in  the  last  section. 

The  result  of  four  nonlinear  conjugate  gradient  steps  is  displayed  in  Figure 
5.  While  the  general  trend  of  the  velocity  to  about  1200  m  is  correct,  the  details 
are  not  reproduced  at  all.  Nonetheless  the  estimated  velocity  is  kinematically 
correct.  This  is  evidenced  by  the  comparison  of  the  inverted  reflectivity  (for 
shot  1  -  shot  2  is  similar)  shown  in  Figure  6a  with  the  inverted  reflectivity  at 
the  correct  velocity  (with  the  same  iteration  parameters)  in  Figure  6b.  The 
two  are  essentially  equally  “fiat”,  whereas  the  inverted  reflectivity  at  constant 
velocity  (Figure  6c)  exhibits  considerable  residual  moveout.  Because  of  the 
translational  invariance  of  data  from  layered  models,  an  individual  reflectivity 
gather  is  equivalent  to  the  (inversion  analogue  of)  a  coherency  or  common  image 
panel,  as  used  in  migration  velocity  analysis  (eg.  Versteeg  and  Grau  1991  [45]). 
Thus  its  flatness  is  diagnostic  of  kinematical  correctness  of  the  velocity  model. 

The  RMS  data  misfit  at  the  end  of  this  exercise  was  roughly  22%.  The 
predicted  data  is  displayed  in  Figure  7,  the  misfit  in  Figure  8.  Evidently  all 
major  features  are  reproduced.  While  some  error  energy  remains  in  the  major 
reflector  at  1.8  s,  much  of  the  remainder  appears  to  be  finite  difference  noise. 
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To  see  the  improvement  possible  through  furtlier  iteration,  we  performed  four 
more  nonlinear  conjugate  gradient  steps.  The  results  are  displayed  in  Figures 
9  and  10.  Evidently  the  velocity  structure  is  somewhat  improved,  though  the 
kinematic  effects  of  this  improvement  are  miniscule. 

Table  1  shows  the  performance  of  the  nonlinear  conjugate  gradient  iteration. 

To  explore  a  little  bit  the  effect  of  one  of  the  parameter  choices,  we  show  in 
Figure  11  the  velocity  estimates  produced  by  choosing  length  scales  of  300  m,  400 
m,  and  500  m  respectively  in  the  velocity  gradient  calculation.  These  estimates 
resulted  from  four  nonlinear  conjugate  gradient  steps  with  other  parameters  set 
as  before.  The  300  m  result  shows  considerable  detail,  all  of  which  is  wrong. 
The  reflectivities  (Figure  12)  are  essentially  equally  flat.  The  500  m  result  shows 
slightly  better  depthing  of  reflectors.  These  e.xperiments  suggest  strongly  that 
only  the  kinematic  aspects  of  velocity  inversions  are  to  be  trusted,  never  the 
short-scale  details. 

Nonlayered  model 

The  second  group  of  experiments  used  two  35  shot  data  sets.  The  goal  of 
this  group  is  to  illustrate  the  ability  of  differential  semblance  to  detect  velocity 
changes.  Accordingly  we  merely  generate  gradients  rather  than  attempt  full 
inversions. 

The  first  data  set  was  generated  from  the  layered  model  used  in  the  first 
group,  so  all  shots  look  exactly  like  those  displayed  in  Figure  3.  Figure  13  shows 
a  grey  scale  plot  of  the  gradient  at  constant  velocity  v  =  1.5  m/ms.  Except  for 
edge  effects  at  the  ends  of  the  line,  this  gradient  is  very  nearly  layered.  Several 
vertical  cross  sections  appear  in  Figure  14,  illustrating  the  layered  nature  of  the 
gradient. 

We  selected  this  configuration  to  illustrate  the  importance  of  the  gradient 
correction.  We  computed  the  slope  of  J  along  the  line  connecting  v  =  1.5m/ms 
and  the  target  velocity  profile  in  two  ways; 

1.  by  the  (Sobolev  s-}  inner  product  of  gradient  with  the  difference  of  the 
velocity  profiles; 

2.  by  very  careful  extrapolated  finite  difference  estimates. 

Table  2  compares  the  results  for  two  different  levels  of  normal  residual  reduc¬ 
tion.  Evidently  the  degree  of  approximation  of  the  finite  difference  estimate  by 
the  naive  gradient  (formula  13)  barely  improves  with  the  threefold  reduction 
in  the  normal  residual,  whereas  the  correct  gradient  (formula  14)  converges 
superlinearly.  We  have  observed  this  behaviour  in  many  examples. 

To  test  the  ability  of  differential  semblance  to  detect  strong  lateral  velocity 
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changes,  we  added  to  the  model  used  in  the  preceding  examples  a  high-velocity 
near-surface  anomaly,  roughly  circular  in  shape  and  with  a  diameter  of  about 
300  m.  A  grey-scale  plot  appears  in  Figure  15.  We  used  the  same  (layered) 
reflectivity  model  as  in  the  previous  examples.  The  synthetic  data  (Figure  16) 
clearly  reveals  the  presence  of  the  anomaly  through  strongly  nonhyperbolic  and 
shot-dependent  moveout. 

We  first  computed  the  gradient  at  constant  u  =  1.5  m/ms.  A  grey-scale  plot 
appears  as  Figure  17.  The  result  is  similar  to  the  gradient  produced  from  the 
layered  data  (Figure  13).  Though  evidence  of  lateral  heterogeneity  is  visible, 
the  actual  character  of  the  anomaly  is  not.  A  conjugate  gradient  step  would 
move  the  velocity  estimate  in  roughly  the  direction  of  the  layered  model. 

We  then  computed  the  gradient  at  the  layered  model  used  in  the  preceding 
experiments.  A  grey-scale  plot  appears  as  Figure  18,  and  a  horizontal  cross- 
section  at  300  m  depth  as  Figure  19.  The  anomaly  is  now  clearly  defined.  A 
single  conjugate  gradient  step  would  add  a  negative  multiple  of  the  gradient  to 
the  layered  model,  producing  a  high-velocity  anomaly  in  exactly  the  right  place, 
of  approximately  correct  extent. 

Comment  on  parameters 

Evidently  differential  semblance  optimization  as  formulated  in  this  paper  de¬ 
pends  on  a  large  number  of  parameter  choices:  inversion  level  and  error  tol¬ 
erance  for  the  Chebyshev  loop,  length  scales  for  the  gradient  smoothing  step, 
regularization  and  differential  semblance  weights,  nonlinear  conjugate  gradient 
parameters,...  Moreover  inclusion  of  flexible  constraints  reflecting  “geological” 
knowledge,  as  in  Stork  and  Clayton  1991  [31]  for  example,  while  essential  for 
the  production  of  meaningful  results,  can  introduce  yet  more  tuning  and  penalty 
parameters.  At  least  partial  automation  of  parameter  selection  is  obviously  es¬ 
sential  to  the  eventual  success  of  differential  semblance  optimization,  or  indeed 
any  geophysical  inversion  tool. 
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Discussion:  Migration  velocity  analysis,  travel 
time  tomography,  least  squares  inversion,  and 
differential  semblance 

Differential  semblance  optimization  is  closely  related  in  conception  and  be¬ 
haviour  to  several  other  approaches  to  velocity  estimation. 

Waveform-based  velocity  estimation  methods  in  widespread  use  rely  on  sys¬ 
tematic  search  through  a  smalt  family  of  models.  A  fundamental  example  of 
this  type  is  semblance  analysis  of  normal  moveout  corrected  common  midpoint 
gathers,  as  in  Taner  and  Koehler  1969  [38].  A  number  of  recent  variants  of  this 
idea  go  under  the  name  migration  velocity  analysis,  in  which  normal  moveout 
correction  is  replaced  by  some  variety  of  prestack  migration  (see  for  example 
Deregowski  1990  [9]  or  the  volume  Versteeg  and  Grau  1991  [45]  for  many  exam¬ 
ples  and  references).  All  u.se  a  measure  of  event  coherence,  either  numerical  (eg. 
stack  power)  or  visual  (flatness  of  events  in  velocity  analysis  or  coherency  pan¬ 
els,  verticality  of  focusing  spots)  or  both.  These  approaches  are  quite  successful 
when  their  assumptions  are  justified,  yet  all  are  limited  primarily  by  the  need 
to  search  systematically  a  range  of  velocity  models.  The  need  for  systematic 
search  follows  from  the  behaviour  of  the  event  coherence  measures.  Because  of 
the  extreme  sensitivity  of  waveform  data  to  velocities,  the  numerical  measures 
are  very  nonconvex,  and  so  must  be  sampled  finely.  The  visual  measures  provide 
no  means  to  improve  the  velocity  automatically,  so  once  again  fine  sampling  is 
necessary.  This  systematic,  finely  sampled  search  perforce  must  be  confined  to 
a  very  small  set  of  velocity  models.  For  example  one  might  consider  piecewise 
linear  interpolates  of  velocity  profiles  aligning  events  at  a  few  control  points. 
The  possibility  of  unintended  bias  is  obvious,  and  sometimes  damaging.  The 
synthetic  model  study  Versteeg  and  Grau  1991  [45]  provides  compelling  evidence 
of  the  limitations  of  these  methods. 

Another  approach  to  velocity  estimation  is  tomographic  inversion  or  travel 
time  analysis.  The  most  successful  instances  allow  the  velocity  model  an  a 
priori  unlimited  range  of  complexity,  subject  only  to  the  constraint  that  the 
model  explain  the  arrival  times  of  events  in  time  or  (migrated)  depth  sections 
(Bishop  et  al.  1985  [2],  Nolet  1987  [23],  Scales  1987  [25],  van  Trier  1990  [42], 
Stork  1992  [30]  and  references  cited  there).  The  various  measures  of  travel  time 
error  proposed  in  these  works  are  quite  well-behaved  numerically  -  i.e.  smooth 
and  (weakly)  convex  over  a  wide  range  of  velocity  models  -  because  they  access 
travel  time  directly.  Travel  time  is  a  quite  tame  function  of  velocities,  in  contrast 
to  the  waveform  (amplitude)  which  is  not,  as  mentioned  earlier.  The  large 
degree  of  nonuniqueness  inherent  in  the  tomographic  problem  formulation  may 
be  controlled  by  imposition  of  geologically  meaningful  constraints  (Stork  and 
Clayton  1991  [31]). 
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A  drawback  of  travel  time  inversion  is  that  travel  times  are  not  the  primary 
data  of  reflection  seismology  experiment,  and  must  be  picked  from  the  waveform 
data.  Travel  time  picking  is  most  difficult  in  exactly  those  situations  in  which 
tomographic  inversion  is  most  needed:  when  the  subsurface  structure  is  complex 
(Ehinger  1992  [10],  Harlan  1992  [15]).  The  possibility  of  bias  also  lurks  in  the 
multiplicity  of  events  that  are  not  picked  in  every  tomographic  exercise. 

A  third  approach  to  velocity  estimation  is  the  simplest  conceptually:  intro¬ 
duce  a  complete  physical  model  to  predict  the  waveform  data  in  detail.  Fol¬ 
lowing  convention  we  will  call  this  approach  inversion.  Most  recent  work  has 
concentrated  on  the  best  fit  formulation  of  inversion:  choose  a  numerical  measure 
of  error  between  the  data  and  the  model-based  synthetic,  and  adjust  the  model 
to  minimize  this  measure.  The  most  popular  error  measure  is  the  mean  square, 
because  of  its  pleasant  computational  properties,  though  other  measures  have 
been  used  as  well.  One  can  find  a  comprehensive  discussion  of  best  fit  inversion 
in  Tarantola  1987  [41],  including  a  specific  formulation  for  inversion  of  reflection 
seismograms  and  an  exposition  of  a  philosophical  justification  of  inversion  based 
on  Bayesian  inference. 

Differential  semblance  optimization  is  related  to  all  three  approaches  to  ve¬ 
locity  estimation.  It  is  a  variant  of  least-squares  inversion,  as  the  objective 
function  contains  a  mean  square  error  term  and  so  minimization  yields  an  in¬ 
version.  The  most  striking  difference  with  ordinary  least-squares  inversion  is 
that  the  model  space  is  expanded,  the  model-parameters  being  regarded  as  a 
priori  experiment-dependent.  Thus  each  experiment  (i.e.  each  shot,  in  the  or¬ 
dinary  organization  of  seismic  reflection  data)  is  explained  by  an  independent 
model.  It  turns  out  that  each  of  the  individual  data-fitting  problems  is  essen¬ 
tially  quadratic  (exactly  quadratic,  in  the  case  of  primaries-only  modeling  as  in 
the  sequel).  Therefore  production  of  an  experiment  dependent  suite  of  models 
fitting  independently  the  results  of  each  experiment  is  a  relatively  straightfor¬ 
ward  task. 

Of  course  such  a  multiplicity  of  models  is  infeasible,  or  incoherent:  the  earth 
is  unique  and  all  experiments  should  be  explained  by  a  single  model.  This  obser¬ 
vation  motivates  the  addition  of  a  penalty  for  variance  amongst  the  experiment- 
dependent  models  to  the  objective  function.  This  term  plays  the  same  role 
as  semblance  or  stack  power  in  migration  velocity  analysis,  which  measures 
the  alignment  of  migrated  images  of  the  same  depth  point.  However  differen¬ 
tial  semblance  optimization  uses  models,  which  should  be  identical,  rather  than 
migrated  images,  which  must  merely  be  aligned.  Since  models  produced  by 
inversion  presumably  do  not  contain  the  amplitude  artifacts  of  migrated  im¬ 
ages,  simple  mean-square  differences  of  neighboring  models  suffice  as  a  reliable 
indicator  of  semblance. 

Note  that  this  conclusion  requires  both  the  use  of  inverted  models  and  the 
application  of  a  differential  semh\a,nce  measure  to  them,  in  contrast  (and  in  par- 
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allel)  to  the  use  of  migrated  images  and  integral  or  global  semblance  measures  in 
migration  velocity  analysis.  Roelof  Versteeg  verified  in  his  1991  [44]  thesis  that 
mean-square  nearest-neighbor  differences  do  not  give  a  robust  velocity  indicator 
if  migrated  images  are  used. 

The  separation  into  velocity  and  reflectivity  submodels  is  a  feature  shared 
with  travel  time  tomography.  As  observed  by  Stork  and  Clayton  1991  [31],  it 
is  essential  in  tomography  to  update  both  parts  of  the  model  simultaneously. 
The  same  is  true  of  differential  semblance  optimization,  which  complicates  the 
calculation  of  the  objective  gradient.  The  resemblance  goes  deeper  than  this, 
however.  Analysis  of  the  differential  semblance  Hessian  shows  that  to  lowest 
order  the  control  over  the  velocity  is  exactly  the  same  as  that  of  a  properly 
constructed  tomographic  Hessian  (Song  and  Symes  1992  [29]).  Moreover,  the 
computed  differential  semblance  velocity  gradient  shown  in  Figure  18  has  exactly 
the  appearance  of  a  tomographic  velocity  gradient,  with  vertical  smearing  and 
side-lobes.  Compare  for  example  with  the  upper  left-hand  image  in  Figure  6  of 
Stork  and  Clayton  1992  [32],  which  has  similar  aperture  and  geometry. 

The  differential  semblance  criterion  -  that  neighboring  inverted  reflectivities 
be  the  same  -  is  the  precise  analog  for  full  waveform  models  of  the  “flat  im¬ 
age”  or  “minimum  reflector  variance”  criterion  used  in  migrated  data  domain 
tomography  (van  Trier  1990  [42],  Stork  1992  [30],  Harlan  1992  [15]). 

In  summary,  differential  semblance  optimization  appears  to  accomplish  to 
some  extent  the  goals  of  reflection  traveltime  tomography  without  event-picking, 
and  of  migration  velocity  analysis  without  requiring  either  sparsely  parame¬ 
terized  models  or  interactive  systematic  search.  It  yields  an  infeasible-point 
approach  to  least  squares  inversion,  and  can  be  implemented  efficiently  with 
accurate  gradient  calculation. 
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Appendix  A:  Computation  of  the  bilinear  oper¬ 
ator  B 


Computation  of  B[q,<j)]  is  a  straightforward  application  of  the  adjoint  state 
method,  as  introduced  in  this  context  by  Tarantola  1984  [39],  Lailly  1983  [18], 
and  others.  The  calculation  described  here  is  essentially  identical  to  that  given 
in  Chavent  and  Jacewitz  1990  [6]  for  calculation  of  the  gradient  of  a  function  of 
a  migrated  field  with  respect  to  velocity. 

The  field  q  here  is  a  reflectivity,  and  will  be  used  in  place  of  ;•  throughout  the 
calculations.  Recalling  the  wave  equations  1  and  2,  we  see  that  the  derivative 
DvF[v,q]6v  of  the  forward  map  with  respect  to  velocity  is  given  by 

D,,F[v,q]6v  -  6p{x,.,  Zr,t-,Xs) 


where 


-  V-)  =  2^  [V^-p  +  2qV^po]  +  2qVHpo 

6po,  Sp  =  0,t  <  0 


We  will  use  the  “usual”  adjoint  field,  i.e.  the  .solution  in  of  9  for  a  test  data 
section  d>.  Then: 

<  D„F[v,  (/]<5n,  d>  > 


Jdx,  J  dx  JdzJ  dt6p(x,z,f,x,) 


d>{Xr,t-,X,)6{x 


^  J  j  J  dtSp  w  (15) 

Because  all  fields  obey  Dirichlet  boundary  conditions  at  the  surface  and  are 
of  compact  support  in  x  for  each  t,  integration  by  parts  is  at  least  formally 
permissible.  We  assume  that  /  is  in  fact  bandlimited;  then  all  fields  are  in 
fact  smooth  away  from  the  source  point,  and  integration  by  parts  is  justified. 
In  discretization,  it  must  be  correct  that  the  discrete  Laplace  operator  and 
boundary  conditions  and  the  quadrature  rule  used  to  replace  the  integrations 
allow  exact  calculation  of  adjoints.  This  is  the  case,  for  example,  if  one  uses 
centered  difference  schemes,  Dirichlet  conditions  on  all  boundaries  implemented 
via  the  method  of  images,  and  the  trapezoidal  rule.  Then  the  discrete  operators 
are  also  self-adjoint,  and  summation  by  parts  goes  exactly  as  does  the  integration 
by  parts  which  we  present  here. 
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Integration  by  parts  and  use  of  the  equations  satisfied  by  6p  yields  the  con¬ 
tinuation  of  15: 


J  dx  j  dz  J  dt  ^2^  [V^p -f  29V^po]  +2q\/^6po  )  w 

=  J  dx  J  dz- — j  J 

dx,  J  dx  J  J  dt6p(iV~{qw) 


(16) 


where  in  the  last  term  we  have  performed  another  integration  by  parts. 

To  “get  6v  out”  of  the  second  term  in  the  last  field,  we  need  to  create  a 
secondary  backpropagated  wavefield  iuq  satisfying 


f  1  d^-wo 
Vu"  dr- 


=  2V'(9u>) 


Wq  —  0,  t  ^  ^inax 

with  homogeneous  boundary  conditions.  Then  substitution  of  the  left-hand  side 
of  the  Wo  equation  in  the  second  term  of  the  second  line  in  16,  integration  by 
parts,  and  use  of  the  equation  satisfied  by  Spo  gives  the  continuation  of  16: 

=  f  dx  f  dz^  f  dxs  f  dt  [(V-p -h  2f/VVo)  u> (V-po)  wo] 

=  <  6v,B[q,<f>]> 


From  this  last  equation  we  can  immediately  read  off  the  prescription  for  com¬ 
puting  B[q,(j)]  given  in  the  text. 
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Appendix  B:  Derivation  of  the  gradient  correc¬ 
tion 

In  this  appendix  we  justify  and  derive  the  gradient  correction  presented  in  the 
text.  The  basis  of  this  calculation  is  our  use  of  Chebyshev  iteration  to  produce 
the  reflectivity  estimate  r[r;]e,|.  We  begin  by  writing  the  iteration  in  its  normal 
three  term  recurrence  form: 

rk+i  =  Wi+i(/  -  sN[v])r,;  +  (1  -  u>k+i)rk-i  +  suJk+iM[v]Fdata 

We  write  a  prefix  6  as  usual  for  perturbation  (directional  derivative)  with  respect 
to  6v  of  all  of  the  quantities  in  this  recursion.  Note  that  the  spectral  bound  of 
Ar[i;]  is  a  stable  function  of  v,  since  A^[n]  is  Hermitian,  so  we  may  regard  the 
quantities  Uk  and  s  as  (locally)  independent  of  v.  We  obtain 

6rk+i  =  Wi+i(/ -  sjV[n])^ri.  +  (1  -  wr.+i)^ri._i 

+swk+ii^M[v]Fdata  -  SN[v]rk),k  -  1,2, ... 

The  key  to  understanding  the  relative  importance  of  the  pieces  of  this  formula 
is  that  as  a  migration  operator,  moves  events  around,  whereas  iVfn],  as  a 

simulation  operator  followed  by  a  migration  operator,  is  kinematically  neutral. 
Thus  6M[v]  involves  infinitesimal  phase  shifts  and  so  enhances  high-frequency 
content,  whereas  does  not.  Accordingly  we  write 


^rk  =  Vk  +  Zk 


where 


2/*,+!  =  uJk+\{I  -  sN[v])yk  +  (1  -  Uk+i)yk-i 

+suk+ii6M[v]Fdata  -  ^l^Mrkr„^^),k  =1,2, ... 

Zk+I  =  Wi.+i(/ -  sA[i;])zfc -K1  -  Wjt+i)2fc_i 

d-swfc+i((5A[n](rfc.„„^  -  rk)),k  -  1,2,  ... 

The  inhomogeneous  term  in  the  recursion  for  t/t  is  independent  of  k,  so  yk  is 
the  result  of  the  application  of  the  same  polynomial  as  is  r*: 

rk  -  Pk(N[v]}M[v]Fdata 
yk  =  PkiN[v]){SM[v]Fdata 

Otherwise  put,  we  can  compute  yk  by  re-running  the  iteration  which  produced 
rk,  with  different  data.  Moreover  the  data  for  yk  contains  hM[v]Fdata,  which 


32 


is  the  source  of  all  instability  in  Svk.  The  data  for  the  Zk  recursion,  on  the 
other  hand,  is  ^'-dependent,  which  renders  z^  essentially  uncomputable  from 
the  point  of  view  of  the  adjoint  calculations  to  come.  However  the  data  for  Zk 
involves  only  (5A^[?;],  which  is  tame,  and  —Vk  which  tends  to  become  small 

as  k  *■  kniaj; . 

Therefore  we  throw  Zk  away  and  replace  6rk  by  yk  in  the  second  term  of 
10,  which  also  involves  the  normal  residual  Cfc.  So  after  k  iterations  we  actually 
have 

DJk[v]6v  =<  F[t),  rjt]  -  Fdata  >  -  <  yk,ek  > 

Now 

<yk,ek  >  =  <  Pk(N[v]){6M[v]Fdata  -  J,  Sit  > 

=  <  6M[v]Fdata  -  ^M[n]F[«, 

-M[v]DyF[v,  rk^„^]6v,  PkiN[v])ek  > 

-  <  Pdata  -  F[v,  )■*„„],  D^F[v,  PkiN[v])ek]6v  > 

-  <  F[t),  Pr.(/V[v])efc]  > 

=  <  Sv,  B[Pk(N[v])€k,  Fdata  -  F[v, 

F[v,  Pk(N[v])ek]  > 

Here  we  have  used  the  definition  of  A^[e],  the  adjoint  relation 

<  F[t;,  9],  (i>  >=<  q,  M[v]<f>  > 

and  relations  between  derivatives  that  follow  from  these. 

Adding  the  naive  term  expressed  in  terms  of  B  as  before  and  setting  k  = 
kmax  we  get  for  gradient  in  the  sense  of  the  Sobolev  s-norm 

gradJ[v]  =  -  Pk^„{N[v])ek„^^^,F[v,rk^^^]  -  Fdata] 

F[t;,  Fj,„.„(yV[t;])ej.„.„] 

It  is  worth  listing  just  what  is  involved  in  this  formula: 

1.  one  round  of  Chebyshev  iteration  to  compute  the  reflectivity 

n-„.„  =  Pk„,.AN{v])M[v]Fdata 
the  residual  F[t»,  —  Fdata  and  the  normal  residual 

2.  another  round  of  Chebyshev  iteration  to  compute  field 

Pk„.„{N[v])ek„,,\ 

3.  two  generalized  adjoint  state  calculations  to  compute  the  values  of  B. 
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Table  1 

Performance  of  NLCG  over  8  Iterations 


_ "h _ Jj/  Jq 

0  0.2394E+01  O.lOOOE+01 

1  0.1337E+01  0.5587E+00 

2  0.1185E+01  0.4952E+00 

3  0.8021E+00  0.3351E+00 

4  0.6488E+00  0.2710E+00 

5  0.6178E+00  0.2581E+00 

6  0.5958E+00  0.2488E+00 

7  0.5736E+00  0.2396E+00 

8  0.5734E+00  0.2394E+00 


WgradJiW  \\gradJi\\/\\gradJo\\ 
0.7779E+00  O.lOOOE+01 

0.7746E-01  0.9958E-01 

0.3577E+00  0.4599E+00 

0.2324E+00  0.2987E+00 

0.2674E+00  0.3437E+00 

0.2176E-01  0.2797E-01 

0.5201E-01  0.6686E-01 

0.1981E-01  0.2546E-01 

0.9622E-02  0.1237E-01 


Table  2 

Assessment  of  Computed  Gradient  Accuracy 


Figure  Captions  -  GP  paper 


1.  A  cartoon  illustrating  the  smoothness  of  differential  semblance  as  a  function  of  ve¬ 
locity.  The  cube  repre.sents  the  prestack  inverted  reflectivity,  plotted  as  a  function 
of  depth,  (global)  horizontal  coordinate,  and  shot  location.  We  imagine  that  the 
(linearized)  inversion  has  been  performed  with  substantially  incorrect  velocity  in 
Figure  la  and  with  the  correct  velocity  in  Figure  lb.  Accordingly  the  common 
depth  point  gathers  (often  called  coherency  or  common  image  panels)  display  no¬ 
ticeable  moveout  for  the  incorrect  velocity  (a),  but  consist  of  identical  traces  for  the 
correct  velocity  (b).  In  reality  edge  effects  will  prevent  complete  short-independence 
in  the  correct  velocity  case.  The  relative  moveout  between  shot  gathers  (or  equally 
well  between  traces  on  a  common  depth  point  gather)  is  small,  amounting  to  much 
less  than  a  wavelength,  because  the  shot  sampling  is  fine.  Therefore  a  small  change 
in  velocity  will  result  in  a  small  change  in  the  differences  between  neighboring  traces 
on  the  common  depth  point  gather,  therefore  a  small  change  in  the  differential  sem¬ 
blance  as  defined  in  the  text.  The  stack  power,  on  the  other  hand,  may  change 
dramatically,  because  it  involves  the  interaction  of  distant  traces  which  may  be  en¬ 
tirely  out  of  phase. 


2.  The  velocity  (a)  and  reflectivity  (b)  profiles  v^xact  and  r^xact  used  in  the  numerical 
experiments  to  generate  the  layered  model.  Horizontal  scale  is  meters,  and  vertical 
scales  are  meters  per  millisecond  and  dimensionless,  respectively. 


3.  The  two-shot  data  set  generated  from  the  layered  model  in  Figure  2.  The  data 
has  been  muted  linearly  to  remove  the  water-bottom  event,  which  becomes  con¬ 
taminated  with  a  perturbational  head-wave  onset  at  larger  offsets.  Vertical  scale 
is  milliseconds.  Trace  spacingis  50  m,  with  near  offset  of  150  m  and  far  offset  of 
1800  m.  Shot  spacing  is  96  m. 


4.  Samples  of  the  differential  semblance  objective  function  along  the  line  segment 
between  the  constant  velocity  Wiase  =  1.5m/»ns  and  Veract-  The  segment  is  param¬ 
eterized  by 

^  ^base  -f  h.  +  ^6a5e] 

so  that  /)  =  0  is  constant  velocity  and  h  =  1  is  the  target  v exact-  The  curves  cor¬ 
respond  to  (7  =  1.0  (dots),  10.0  (dash-dots),  100.0  (dashes),  and  1000.0  (solid  line). 
Note  that  the  minimizer  is  located  in  all  cases  at  or  near  h  =  1.0,  and  that  the 
curves  appear  to  be  converging  as  a  — +  oo.  This  convergence  effect  is  peculiar  to 
2-shot  data  sets,  as  pointed  our  in  the  text. 


5.  The  result  of  four  nonlinear  conjugate  gradient  steps  (solid)  versus  Vexact  (dashes). 
Note  that  the  overshoot  from  the  surface  to  roughly  400  m  is  balanced  by  under¬ 
shoot  between  400  and  700  m.  Relatively  little  reflected  energy  is  present  in  the 
data  (Figure  3)  between  the  surface  and  700  m.  Scales  same  as  Figure  2a. 


6a.  The  reflectivity  gather  from  sliot  1,  corresponding  to  the  velocity  inversion  in  Figure 
5.  Note  that,  apart  from  edge  diffractions,  the  gather  is  quite  flat.  Vertical  scale  is 
meters,  trace  spacing  is  16  m. 

b.  The  reflectivity  gather  obtained  by  linearized  inversion  of  the  data  from  Figure  3 
at  the  target  velocity  Vexact-  Visually,  flatness  roughly  equivalent  to  that  displayed 
in  Figure  6a.  Inversion  performed  with  the  same  parameters  as  in  Figure  6a. 

c.  The  reflectivity  gather  obtained  by  linearized  inversion  of  the  data  from  Figure  3  at 
the  constant  velocity  Vba»e-  Note  the  marked  moveout.  Since  the  data  is  translation 
invariant,  this  single  shot  gather  is  actually  equivalent  to  a  common  depth  gather 
(common,  image  panel). 


7.  Data  predicted  from  the  inverted  model  (Figures  5,  6a). 


8.  Error  (difference  between  Figures  3  and  7),  plotted  on  same  amplitude  scale  as 
Figure  7.  Error  level  is  appro.ximately  15%  RMS. 


9.  Result  of  eight  nonlinear  conjugate  gradient  steps  (solid  line),  compared  with  the 
result  of  four  steps  (dashes)  and  target  (dash-dots). 


10.  Reflectivity  corresponding  to  the  velocity  inversion  in  Figure  9. 


11.  Velocities  obtained  after  four  steps  of  nonlinear  conjugate  gradient  iteration  using 
smoothing  lengths  of  300  m  (da.sh-dots),  400m  (solid),  500  m  (dashes),  and  target 
(dots). 


12a.  Reflectivity  gather  produced  by  inversion  at  the  300  m  velocity  model  (Figure  11, 
dash-dots). 

b.  Reflectivity  gather  produced  by  inversion  at  the  400  m  velocity  model  (Figure  11, 
solid). 

c.  Reflectivity  gather  produced  by  inversion  at  the  500  m  velocity  model  (Figure  11, 
dashes). 


13.  Grey-scale  plot  of  DSO  gradient  at  constant  velocity,  computed  for  a  35-shot  data 
set  generated  from  the  layered  model  (Figure  2).  Black  is  negative,  white  is  positive. 
Thus  this  gradient  suggests  a  general  increase  in  velocity,  essentially  layered  within 
the  inversion  aperture.  Edge  anomalies  occur  in  the  vicinities  of  the  first  and  last 
shots  in  the  line. 


14.  Several  vertical  cross  sections  of  the  gradient  in  Figure  13,  illustrating  the  extent 
to  which  it  is  layered. 


15.  Grey-scale  plot  of  layered  model  with  high  velocity  near  surface  anomaly. 


16.  Some  gathers  from  the  35-shot  data  set  derived  from  the  anomaly  model  of  Figure 
13  and  the  layered  reflectivity  of  Figure  2b. 


17.  Gradient  at  constant  background  velocity.  Evidence  of  nonlayering  present,  but  no 
clear  indication  of  its  form. 


18.  Gradient  at  layered  background  {=  v^xact  shown  in  Figure  2,  i.e.  without  the 
anomaly).  Location  and  type  of  the  anomaly  clearly  defined.  The  vertical  smearing 
and  side  lobes  are  characteristic  of  all  limited-aperture  tomographic  reconstruction 
methods. 


19.  Horizontal  cross  section  through  layered  background  gradient  (Figure  18)  at  240  m 
depth. 
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1(b)  REFLECTIVITY  INVERSION  AT  CORRECT  VELOCITY 
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